Search CORE

3 research outputs found

A Study on the Impact of Feature Selection on Data Analysis

Author: Njoku Uchechukwu Fortune
Publication venue
Publication date: 04/09/2021
Field of study

Impact of filter feature selection on classification: an empirical study

Author: Abelló Gamazo Alberto
Bilalli Besim
Bontempi Gianluca
Njoku Uchechukwu Fortune
Publication venue: CEUR-WS.org
Publication date: 01/01/2022
Field of study

The high-dimensionality of Big Data poses challenges in data understanding and visualization. Furthermore, it leads to lengthy model building times in data analysis and poor generalization for machine learning models. Consequently, there is a need for feature selection, which allows identifying the more relevant part of the data to improve the data analysis (e.g., building simpler and more understandable models with reduced training time and improved model performance). This study aims to (i) characterize the factors (i.e., dataset characteristics) that influence the performance of feature selection methods, and (ii) assess the impact of feature selection on the training time and accuracy of binary and multiclass classification problems. As a result, we propose a systematic method to select representative datasets (i.e., considering the distributions of several dataset characteristics) in a given repository. Next, we provide an empirical study of the impact of eight feature selection methods on Naive Bayes (NB), Nearest Neighbor (KNN), Linear Discriminant Analysis (LDA), and Multilayer Perceptron (MLP) classification algorithms using 32 real-world datasets and a relative performance measure. We observed that feature selection is more effective in reducing training time (e.g., up to 60% for LDA classifiers) than improving classification accuracy (e.g., up to 5%). Furthermore, we observed that feature selection gives slight accuracy improvement for binary classification (i.e., up to 5%), while it mostly leads to accuracy degradation for multiclass classification. Although none of the studied feature selection methods is best in all cases, for multiclass classification, we observed that correlation based and minimum redundancy maximum relevance feature selection methods gave the best results in accuracy. Through statistical testing, we found LDA and MLP to benefit more in accuracy improvement after feature selection than KNN and NB.The project leading to this publication has received funding from the European Commission under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 955895).Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC

Wrapper methods for multi-objective feature selection

Author: Abelló Gamazo Alberto
Bilalli Besim
Bontempi Gianluca
Njoku Uchechukwu Fortune
Publication venue: OpenProceedings
Publication date: 01/01/2023
Field of study

The ongoing data boom has democratized the use of data for improved decision-making. Beyond gathering voluminous data, preprocessing the data is crucial to ensure that their most rele- vant aspects are considered during the analysis. Feature Selection (FS) is one integral step in data preprocessing for reducing data dimensionality and preserving the most relevant features of the data. FS can be done by inspecting inherent associations among the features in the data (filter methods) or using the model per- formance of a concrete learning algorithm (wrapper methods). In this work, we extensively evaluate a set of FS methods on 32 datasets and measure their effect on model performance, stability, scalability and memory usage. The results re-establish the superiority of wrapper methods over filter methods in model performance. We further investigate the unique role of wrapper methods in multi-objective FS with a focus on two traditional metrics - accuracy and Area Under the ROC Curve (AUC). On model performance, our experiments showed that optimizing for both metrics simultaneously, rather than using a single metric, led to improvements in the accuracy and AUC trade-off1 up to 5% and 10%, respectively.The project leading to this publication has received funding from the European Commission under the European Union’s Horizon 2020 research and innovation programme (grant agreement No 955895). Besim Bilalli is partly supported by the Spanish Ministerio de Ciencia e Innovación, as well as the European Union-Next Generation EU, under the project FJC 2021-046606-I/AEI/10.13039/501100011033. Gianluca Bontempi was supported by Service Public de Wallonie Recherche undergrant n°2010235–ARIAC by DIGITALWALLONIA4.AI.Peer ReviewedPostprint (published version

UPCommons. Portal del coneixement obert de la UPC